导航菜单
首页 >  Search PDF text images and tables with Python CLIP  > PDF Extraction: Retrieving Text and Tables together using Python

PDF Extraction: Retrieving Text and Tables together using Python

When working with PDF files, extracting both text and tables can be challenging due to their complex structure. However, the “pdfplumber” library offers a powerful solution for this task. This article explores an effective method for combining text and table extraction from PDFs using pdfplumber. Special thanks to Karl Genockey a.k.a. cmdlineuser and other contributors for their brilliant approach discussed here.

Understanding the Approach

The method involves extracting table objects and text lines separately and then combining them based on their positional values. This ensures that the extracted data maintains the correct order and structure as it appears in the PDF. Let’s break down the code and logic step-by-step.

We will use the sample_pdf below, as an example, containing both tables and text in multiple pages.

sample_pdf

drive.google.com

Preview of a page from the sample pdf containing text and tablesPrerequisites

Before running the code, we should ensure that the necessary libraries are installed. Besides pdfplumber and pandas, we also need the tabulate library. This library is used by pandas To convert DataFrame objects to Markdown format, which is crucial for our table extraction process. This conversion helps in maintaining the structure and readability of table data extracted from the PDF.

Installing Required Libraries

You can install these libraries using pip. Run the following commands in your

pip install pdfplumber pandas tabulateStep-by-Step ExplanationImport Libraries: First things first, we start by importing all necessary libraries.pdfplumber is used for extracting text and tables from PDFs.pandas is used for handling and manipulating data.extract_text, get_bbox_overlap, and obj_to_bbox are utility functions from pdfplumber.tabulate helps in converting data into Markdown format.import pdfplumberimport pandas as pdfrom pdfplumber.utils import extract_text, get_bbox_overlap, obj_to_bboximport tabulate

2. Function Definition and PDF Opening:

The function process_pdf takes pdf_path as an argument, which is the path to the PDF file.pdfplumber.open(pdf_path) opens the PDF file.all_text is initialized as an empty list to store the extracted text from all pages.def process_pdf(pdf_path): pdf = pdfplumber.open(pdf_path) all_text = []

3. Iterate Over Pages:

for page in pdf.pages — The for loop iterates over each page in the PDF.filtered_page — is initially set to the current page.chars — captures all characters on the filtered_page. for page in pdf.pages:filtered_page = pagechars = filtered_page.chars

4. Table Detection and Filtering:

for table in page.find_tables() — The for loop iterates over each table found on the page.first_table_char — stores the first character of the cropped table area.filtered_page — is updated by filtering out characters that overlap with the table's bounding box using get_bbox_overlap and obj_to_bbox.for table in page.find_tables():first_table_char = page.crop(table.bbox).chars[0]filtered_page = filtered_page.filter(lambda obj: get_bbox_overlap(obj_to_bbox(obj), table.bbox) is None)chars = filtered_page.chars

5. Extract and Convert Table to Markdown:

table.extract() extracts the table content.A DataFrame df is created from the extracted table data.The first row is set as the header using df.columns = df.iloc[0].The rest of the DataFrame is converted to Markdown format and stored in markdown.df = pd.DataFrame(table.extract())df.columns = df.iloc[0]markdown = df.drop(0).to_markdown(index=False)

6. Append Markdown to Characters:

The first_table_char is updated with the markdown content and appended to chars.chars.append(first_table_char | {"text": markdown})

7. Extract Page Text:

extract_text(chars, layout=True) extracts the text from the filtered characters with layout preservation.The extracted text page_text is appended to all_text.page_text = extract_text(chars, layout=True)all_text.append(page_text)

8. Close PDF and Return Text:

The PDF file is closed using pdf.close().The extracted text from all pages is joined into a single string with newline characters and returned.pdf.close()return "\n".join(all_text)

9. Execute Function and Print Result:

The path to the PDF file is defined in pdf_path.process_pdf(pdf_path) is called to process the PDF and extract text.The extracted text is printed.# Path to your PDF filepdf_path = r"sample_pdf.pdf"extracted_text = process_pdf(pdf_path)print(extracted_text)Complete Code

Here is the complete script for extracting text and tables as markdown from a PDF:

import pdfplumberimport pandas as pdfrom pdfplumber.utils import extract_text, get_bbox_overlap, obj_to_bbox

def process_pdf(pdf_path):pdf = pdfplumber.open(pdf_path)all_text = []

for page in pdf.pages:filtered_page = pagechars = filtered_page.chars

for table in page.find_tables():first_table_char = page.crop(table.bbox).chars[0]filtered_page = filtered_page.filter(lambda obj: get_bbox_overlap(obj_to_bbox(obj), table.bbox) is None)chars = filtered_page.chars

df = pd.DataFrame(table.extract())df.columns = df.iloc[0]markdown = df.drop(0).to_markdown(index=False)

chars.append(first_table_char | {"text": markdown})

page_text = extract_text(chars, layout=True)all_text.append(page_text)

pdf.close()return "\n".join(all_text)

# Path to your PDF filepdf_path = r"sample_pdf.pdf"extracted_text = process_pdf(pdf_path)print(extracted_text)

Output :Hello

World

| First name| Last name|Age | City||:-------------|:------------|------:|:------------|| Nobita| Nobi|15 | Tokyo|| Eli | Shane|23 | Orlando || Rahul| Jain|22 | Los Angeles || Lucy | Carlyle |17 | London || Anthony | Lockwood|19 | Leicester|

Loreum ipsum

dolor sit amet,

consecteturadipiscing

Hello

Python

| First name| Last name| Address ||:-------------|:------------|:--------------------|| James| Watson | 221 B, Baker Street || Mycroft | Holmes | Diogenes Club|| Irene| Adler| 21 New Jersey|| Lucy | Carlyle | 33 Claremont Square || Anthony | Lockwood| 35 Portland Row |

Neque porroquisquam est qui

dolorem

ipsum quia dolor sit amet,

consectetur, adipiscivelit..."

Conclusion

This approach provides a systematic way to extract and combine text and tables from PDFs using “pdfplumber”. By leveraging table and text line positional values, we can maintain the integrity of the original document’s layout. Credits to cmdlineuser and jsvine for their insightful discussion and innovative solution to the problem!

That’s all for now! Hope this tutorial was helpful. Feel free to explore and adapt this method to fit your specific needs.

相关推荐: